- Published on
📝 Deep Analysis:The Internal Logic of LLM Safety Mechanisms
- Authors

- Name
- Chengchang Yu
- @chengchangyu
🎯 Core Four-Element Analysis
📌 Fundamental Problem
How do LLM safety alignment and jailbreak attacks actually work? Why can aligned models still be jailbroken to bypass safety guardrails? What's really happening inside these black-box models?
🔍 Key Perspective
Critical Insight: Explain safety mechanisms through intermediate hidden states rather than just final outputs. The authors discovered:
- Ethical concepts are learned during pre-training, not alignment
- Safety alignment's essence is building associations: connecting early-layer ethical judgments with mid-layer emotional guesses
⚙️ Key Method
Weak-to-Strong Explanation (WSE): Using weak classifiers (SVM and MLP) to analyze strong LLM's intermediate hidden states
Technical Approach:
- Extract last position hidden states [u_l] from each layer
- Use weak classifiers to judge if states are ethical
- Apply Logit Lens to decode mid-layer states into tokens, observing emotional evolution
- Propose Logit Grafting to simulate jailbreak's disruption of association stage
💡 Core Findings
Three-Stage Safety Mechanism:
- Early Layers (0-5): Models immediately identify malicious inputs based on ethical concepts learned during pre-training (>95% accuracy)
- Middle Layers (16-24): Alignment training associates ethical judgments with emotions (normal inputs → positive emotions; malicious inputs → negative emotions)
- Later Layers (25-32): Refine emotions into specific rejection/response tokens
Jailbreak's Essence:
- Jailbreak cannot deceive early-layer ethical judgments
- Jailbreak disrupts mid-layer associations, perturbing negative emotions into positive ones
- When positive emotions dominate mid-layers, later layers generate harmful content
📐 Method Formalization
LLM Safety = Pre-training(Ethical Concepts) + Alignment(Association Mapping) + Refinement(Stylized Output)
Where:
- Early-layer Classification = Weak_Classifier(hidden_state_l) → {Ethical | Unethical}
- Mid-layer Association = Ethical_Judgment × Alignment_Weight → {Positive | Negative Emotion}
- Late-layer Refinement = Emotion_State → {Rejection_Token | Response_Token}
Jailbreak Attack = Perturb(Mid-layer Association) → Positive Emotion → Harmful Output
Logit Grafting Approximation:
Jailbreak Effect ≈ Replace(Malicious_Input_Mid_Layer_State, Normal_Input_Positive_Emotion_State)
🎤 One-Sentence Summary (Core Value)
This paper uses weak classifiers to analyze LLM intermediate hidden states, revealing a three-stage safety mechanism: pre-training learns ethical concepts that enable early layers to rapidly identify malicious inputs, alignment builds ethical-emotional associations in middle layers, while jailbreak attacks bypass safety by disrupting this association process - perturbing negative emotions into positive ones to generate harmful content.
This analysis is based on the research paper "How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States" by Alibaba Group and Tsinghua University.